Skip to main content

THIS DOCUMENT IS DEPRECATED AND IS FOR DRAFTING USE ONLY

ChoRUS Broad Standard Operating Procedures for Data Contributing Sites

Purpose

This document is for data contributing sites to better understand what is expected to fulfill contractual obligations and what assistance you can expect from each module. This document describes the types of activities leading to the production of harmonized structured and unstructured EHR data, Images, and Monitor data.

Structured EHR data:

  • Identify the source of the data

Some sites have access to and existing repository, while other sites need to formulate a data extraction request to authorized entities (such as some data group within their CTSI).

  • Formulate the specs of the data extraction

The specification includes:

(i) a description of the population of patients

(ii) the scope of the extraction, which specifies what data is needed over what period of time.

In the pilot phase (April-June 2023), the request is for data from hospital admission to hospital discharge. The time when these patients were admitted does not matter, not does their admitting diagnoses. We would prefer admission dates to be post-COVID.

For subsequent phases, the number of encounters to be contributed will be decided on a site-by-site basis. The specs also describe which data elements are needed. We require that sites provide all Tier 1 data elements (“Flowsheet and Tier categories” sheet, column H), and encourage the contribution of Tier 2 or lower data elements whenever available.

EHR domainSpecific variablesOMOP clinical tables (CDM 5.4)
PersonAge, Race, ethnicity, death (Y/N), death date, discharge destination, addressPERSON_ID, DEATH
Visit & locationDates of all location changes, type of locationOBSERVATION_PERIOD, VISIT_OCCURENCE and VISIT_DETAIL
Measurements and assessmentsLabs, vital signs, input/outputs, transfusions, other Tier 1 flow sheet elementsMEASUREMENT, OBSERVATIONS
Diagnoses and problemsAdmission diagnosis, Discharges diagnoses,CONDITION_OCCURENCE
ProceduresX-rays, Dialysis, CRRT, EGD, EEG Bronchoscopies, Specific surgeries. Most of these will have a corresponding note/report)PROCEDURE_OCCURENCE
DevicesEndotracheal tubes, central lines, pacemakers, EVD, etc.DEVICE_EXPOSURE
DrugsInstances of drug administration (as opposed to drug orders)DRUG_EXPOSURE
SpecimensCultures, pathology, cytology, etcSPECIMEN

Table 1 provide a summary this list and how data elements should be assigned across OMOP tables. It is understood that not all data are available for the duration of the hospital admission.

  • Identify relevant metadata

Please include here data that can fundamentally change the interpretation of the data above. For example, the make/model of bedside monitor/ventilator, infusion pump system used, how blood pressure and temperature are typically measured (instrument and anatomical location), POC glucose device most commonly used. These metadata may be difficult to collect and there is (not yet) a codified system for it.

  • Process the request

Your data team will execute the data extraction, or these specs will be forwarded to the designated group within your institution. Please document when data are received.

  • Process the data - deidentification

If this has not been taken care of in the process of extraction, please develop a process for the de-identification of all personality identifiable information. You IRB might have allowed preserving elements of dates. This step will require your data team to have honest broking privileges and to create various “cross-walk” or mapping files such that the data can be ultimately remapped for future purposes. The research team does not have access to those maps, but they may be used, for example, to link to long-term outcomes (SSDMF) or to tools extracting geographic determinants of health using full addresses or zip codes.

  • Process the data – Quality

Please develop a process to verify that each piece of data requested is included in the data you receive, as well as the degree of missingness of variables requested. CHoRUS will recommend that you develop a process, following the recommendations of the Standards Module expert, to map your data extraction to standard OMOP tables using tools of your choice. The PERSEUS OHDSI toolkit is recommended, as it includes a fully-fledged Data Quality Dashboard, in addition to assist in the customization of the Extraction-Transformation-Loading (ETL) process to your own extract. It is an expectation that site run OHDSI’s DQD -ARES tools on their OMOP tables prior to upload.

  • Upload the extract to the CHoRUS cloud

Please upload all data tables, and metadata files to your site-specific cloud enclave within the CHoRUS cloud environment. Full specifications of this process is included in the UPLOAD SOP. A link to this SOP is provided in the parent document.

These steps are for guidance only. Yet they provide a map of the typical steps necessary to produce a quality data extraction. Your site process will have variations on those steps. A description of those variations should also accompany your extract, so CHoRUS fully understands the process at each site and is in the best possible position to help local teams.

Unstructured EHR data

As specified in the exploratory calls with sites, CHoRUS is requesting a tokenized version of the notes, not the notes or reports themselves. Here “tokens” refer to clinical entities (e.g., Diagnoses, Procedures, Medications, etc.) mentioned in text along with associated metadata (e.g., the Document ID, OMOP Athena Concept Code associated with the clinical entity, negation, certainty)

  • Final deliverable

The expected final product is the table NOTE_NLP of the OMOP5.4 Common Data Model. The specific output should follow the following standard.

Please note, that although CHoRUS is not requiring actual notes, only the NOTE table includes linkage information to the PERSON table. Thus, sites also have to produce a NOTE table including, at a minimum, the person\_id and note\_id fields so the NOTE_NLP table can be appropriately linked to the PERSON table.

  • Initial extraction

The initial extraction will be site dependent but will follow the general process. A list of patients will be generated from a structured EHR data. To each admission corresponds admission and discharge dates. The initial step is to request/perform an extraction of all reports and notes generated during the visit corresponding to these dates. The type of notes/reports to generate pertain

Finer concept_id are defined in LOINC OMOP regarding consults from specific services. Feel free to use these finer concept_id

No specific OMOP concept_id’s exist for these types of reports. Your site might have adopted a different concept_id. Please share through discussions

  • De-identification (Optional)

The second step in the process is to produce a de-identified version of the UEHR data. The necessity of this step is dictated by local data governance rules and is therefore site-dependent. The process of de-identification may be transparent to investigators, as institutional data scientist may already provide a de-identified version of all UEHR. More likely though, it will be the investigators task to produce this de-identified version. De-identification is the process of producing a version of the note devoid of PHI. If elements of dates are preserved, the result is a “limited” dataset. If elements of dates are not preserved, but dates are time shifted, the resulting dataset is “safe harbor”. This is a general definition of those terms and relevant to all data, not only UEHR.

Investigators are encouraged to find out whether there is an institution-favored method of de-identifying text, or an institutional policy pertaining to verify the result of text de-identification. We note that CHoRUS is not requesting de-identified text. Free software to de-identify text is available. Examples include NLM Scrubber, UCSF’s Philter, and Microsoft’s Presidio. As these approaches are computational/algorithmic, additional evaluation is required. Such an evaluation would need gold standard annotation of a subset of notes (the i2b2 de-identification task would be a good reference for such an annotation task) and comparison of results. In the absence of a specific institutional policy, it is generally recommended that at least 25 notes of each type be manually reviewed to verify the results of the de-identification method.

At this stage, it is important to become familiar with the OMOP NOTE and NOTE_NLP table structures. Every note de-identified will require a unique “note_id” , to be linked with a date, person and an occurrence, as well as a note type and possibly subtypes, at the very least. Thus, in the process of de-identification, enough information must be preserved to link each note with the other data domains.

Importantly, the basic contract data generation sites executed with CHoRUS specified that a limited dataset, with elements of dates preserved, should be delivered to CHoRUS. However, many institutions will not allow this to happen. Please, communicate with CHoRUS whether your contract specifies, and you will be delivering a limited or safe harbor dataset. See additional information on this topic under the structured EHR rubric.

  • Producing standard concepts (tokens) from notes – the OHNLP pipeline

The ultimate UEHR data deliverable to CHoRUS is an OMOP NOTE_NLP table, which, for each note_id, will produce possible a large number of concepts. There is an excellent series of office hours put forth by Andrew Wen on the installation or the OHNLP software, and verification of its performance. It is strongly encouraged that you carefully review these lectures, as well as provided examples.

Waveform data

Overview

Figure 1. Generalized Device Data Acquisition Process

Device data acquisition, standardization, and ingestion into a Bridge2AI OMOP Data Warehouse can consist of two types of data: a) High frequency signals like multi-lead electrocardiogram, EEG, arterial blood pressure, respiratory, and other vitals commonly produced by patient monitors, ventilators, and other physiological signal sensing devices at the patient bedside (reported at greater than 1 sample per second); and b) Low frequency data include trending physiological measurements, EHR captured/documented measurements (hourly blood pressure readings, alarms, alarm settings, etc). High frequencies are often captured through dedicated device gateway integration solutions through direct serial or networked data capture. Low frequency device data are more often captured through HL7 transactions and manual EHR documentation by clinicians and nursing staff.

  • Device Meta Data

Two sets of device meta data will be needed from participating B2AI data acquisition sites representing high frequency data and low frequency data. For high frequency data, meta data may include the following:

Table 1. High Frequency Device Meta Data (per signal to be contributed)

Meta DataDescriptionExamplesNotes
LabelPhysiological signal label available for capture through serial or network connections.“II ECG”, “respiratory rate”, “heart rate”, etcLabels are not assumed to comply with any particular standard
SourceDevice identifier – manufacturer, model, department, floor, bed. UDI if availablePhillips, Drager MediBus, AviaA, Respironics Esprit V60Source data can facilitate linking of device to EHR records
Sampling RateSampling frequency of the physiological signal62 sps, 125 sps, 250 sps, 500 sps, 4000 spsRespiratory, blood pressure, and other low-frequency signals are typically sampled under 250 sps. EEG, ECG, ABP, and other signals in which morphology analysis is common are sampled at 125 sps or greater.
Amplitude ResolutionAmplitude resolutionIn terms of analog to digital converter (A/D) units, uV, mV, mmHg, beats per minute, breaths per minuteIf A/D units, additional A/D meta data will be required to convert A/D units to standard clinical units
BaselineSignal baseline value (virtual zero)In terms of analog to digital converter (A/D) unitsNormalized signals are adjusted to absolute zero in standard units
Amplitude OffsetSignal offset valueIn terms of analog to digital converter (A/D) unitsNormalized signals are adjusted to absolute zero in standard units
GainA/D Scale factorInteger or double valueAdditional A/D gain applied to the raw signal – used to convert A/D to standard units of measure
LengthMonitoring lengthMs, seconds, minutes, hoursTotal monitoring length. Data may include periods of leads off, artifacts, etc.
QualityOverall signal qualityLow, Moderate, HighSite algorithm or methods used to assess and label signal quality. Quality may be associated with time periods.
AnnotationsIndicate whether signal annotations are providedBeat location, artifact location, quality annotationsMay be generated at the data staging (Fig 1)

For low frequency data, meta data may be derived from the manufacturer documentation. Manufacturers of physiological medical devices provide a data dictionary or specification document listing the signals and features reported by the device transmitted within HL7 transactions. Data dictionary labels are often used to label OBX segments within a HL7 transaction. Data capture and mapping to OMOP can use these device data dictionaries to form the Feature to OMOP ETL mapping document (see example MIMIC-IV – Appendix A; Phillips iX – Appendix B).

Table 2. Low Frequency Device Meta Data (per signal to be contributed)

Meta DataDescriptionExamplesNotes
LabelPhysiological signal label available for capture through serial or network connections.“II ECG”, “respiratory rate”, “heart rate”, “Alarm x”Labels are not assumed to comply with any particular standard
SourceDevice identifier – manufacturer, model, department, floor, bed. UDI if availablePhillips, Drager MediBus, AviaA, Respironics Esprit V60Source data can facilitate linking of device to EHR records
Sampling RateSampling frequency of the featurePer min, per hour, per day, date/timeIf not continuous, sampling rate will not apply and measurements/observations will use the HL7 date/time reported in the OBX segment.
Amplitude ResolutionAmplitude resolutionuV, mV, mmHg, beats per minute, breaths per minuteAssumes standard clinical units

Imaging data

Data Processing Steps

  1. Identify data extraction partners and methods (e.g., CTSI)
  2. Identify the sources of the data (e.g., PACS, site research repository)
  3. Determine which image modalities would need to be extracted, date range, and size
  4. Confirm the type of data format you will receive (e.g., DICOM)
  5. Confirm the type of IDs used to link this data to the patient (e.g., name) or other hospital ID (e.g., MRN)
  6. Map local relevant metadata with CHORUS metadata using OMOP format
  7. Identify metadata not covered by CHORUS that would be relevant
  8. Data deidentification and cross-walk to PHI
  9. Data quality review (e.g., data missingness)
  10. Upload imaging data to the CHoRUS cloud
  11. Upload all data tables and metadata files to your site-specific cloud enclave within the CHoRUS cloud environment.

Imaging Metadata OMOP CDM

Files will be organized hierarchically by patient, modality, study_UID, and, series_UID.

  • Data Summary by Modality**

We are asking for a summary of radiology occurrence count and radiology image count by imaging modality (e.g., CT, MR). Within those groups, we would like to have a breakdown of manufacturers and body part imaged (e.g., brain, chest).

Table X. Radiology Occurrence

FieldRequiredData typeDescription
image_occurrence_id (PK)YesintegerThe unique key is given to an imaging study record (often referred to as the accession number or imaging order number)
person_id (FK)YesintegerThe person_id of the Person for whom the procedure is recorded. This may be a system-generated code.
procedure_occurrence_id (FK)YesintegerThe unique key is given to a procedure record for a person. Link to the Procedure_occurrence table.
visit_occurrence_id (FK)NointegerThe unique key is given to the visit record for a person. Link to the Visit_occurrence table.
anatomic_site_concept_id (FK)NointegerAnatomical location of the imaging procedure by the medical acquisition device (gross anatomy). It maps the ANATOMIC_SITE_SOURCE_VALUE to a Standard Concept in the Spec Anatomic Site domain. This should be coded at the lowest level of granularity.
wadors_uriNovarchar (max)A Web Access to DICOM Objects via Restful Web Services Uniform Resource Identifier on study level.
local_pathYesvarchar (max)Universal Naming Convention (UNC) path to the folder containing the image object file access via a storage block access protocol. (e.g., \Server\Directory)
image_occurrence_dateYesdateThe date the imaging procedure occurred.
image_study_UIDYesvarchar (250)DICOM Study UID
image_series_UIDYesvarchar (250)DICOM Series UID
modalityYesvarchar (250)DICOM-defined value (e.g., US, CT, MR, PT, DR, CR, NM)

Table X. Radiology Image

FieldRequiredData typeDescription
image_feature_id (PK)YesintegerThe unique key is given to an imaging feature.
person_id (FK)Yesinteger

The person_id of the Person table for whom the

the procedure is recorded. This may be a system-generated code.

image_occurrence_id (FK)YesintegerThe unique key of the Image_occurrence table.
table_concept_idYesintegerThe concept_id of the domain table that feature is stored in Measurement, Observation, etc. This concept should be used with the table_row_id.
table_row_idYesintegerThe row_id of the domain table that feature is stored.
image_feature_concept_idYesintegerConcept_id of standard vocabulary—often a LOINC or RadLex of image features
image_feature_type_concept_idYesinteger

This field can be used to determine the provenance of the imaging features (e.g., DICOM SR, algorithms used on

images)

image_finding_concept_idNointeger

RadLex or other terms of the groupings of image

feature (e.g., nodule)

image_finding_numNointegerInteger for linking related image features. It should not be interpreted as an order of clinical relevance.
anatomic_site_concept_idNointegerThis is the site on the body where the feature was found. It maps the ANATOMIC_SITE_SOURCE_VALUE to a Standard Concept in the Spec Anatomic Site domain.
alg_systemNo

varchar

(max)

URI of the algorithm that extracted features,

including version information

alg_datetimeNodatetimeThe date and time of the algorithm processing.